提供者：刘唯

简介

VQA是一个包含有关图像的开放式问题的数据集。这些问题需要理解视野和语言。有以下特点：

1.265,016张图片（COCO和抽象场景）

2.每张图片至少有3个问题（平均5.4个问题）

3.每个问题10个基本事实

4.每个问题3个似乎合理（但可能不正确）的答案

5.指标自动评估

大小

25GB（压缩包）

数量

265,016张图片，每张图片至少3个问题，每个问题10个基本事实

地址

http://www.visualqa.org/

相关论文

[1] A.Agrawal,D.Batra,andD.Parikh.AnalyzingtheBehavior of Visual Question Answering Models. In EMNLP, 2016. 1
[2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Deep compositional question answering with neural module networks. In CVPR, 2016. 2
[3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015. 1, 2, 4, 5, 6, 9, 10
[4] X. Chen and C. L. Zitnick. Mind’s Eye: A Recurrent VisualRepresentationforImageCaptionGeneration.InCVPR, 2015. 1
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009. 9
[6] J. Devlin, S. Gupta, R. B. Girshick, M. Mitchell, and C. L. Zitnick. Exploring nearest neighbor approaches for image captioning. CoRR, abs/1505.04467, 2015. 1
[7] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In CVPR, 2015. 1
[8] H. Fang, S. Gupta, F. N. Iandola, R. Srivastava, L. Deng, P. Doll´ar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From Captions to Visual Concepts and Back. In CVPR, 2015. 1
[9] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M.Rohrbach. MultimodalCompactBilinearPoolingforVisualQuestionAnsweringandVisualGrounding. InEMNLP, 2016. 2, 6, 7
[10] H. Gao, J. Mao, J. Zhou, Z. Huang, and A. Yuille. Are you talking to a machine? dataset and methods for multilingual image question answering. In NIPS, 2015. 1, 2
[11] Y. Goyal, A. Mohapatra, D. Parikh, and D. Batra. Towards Transparent AI Systems: Interpreting Visual Question Answering Models. In ICML Workshop on Visualization for Deep Learning, 2016. 4